AITopics | language and vision model

Collaborating Authors

language and vision model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Neural Information Processing SystemsDec-25-2025, 12:45:58 GMT

The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.

artificial intelligence, mamba-based traversal, rationale, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Neural Information Processing SystemsMay-26-2025, 23:43:40 GMT

artificial intelligence, language and vision model, rationale, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

UlcerGPT: A Multimodal Approach Leveraging Large Language and Vision Models for Diabetic Foot Ulcer Image Transcription

Basiri, Reza, Abedi, Ali, Nguyen, Chau, Popovic, Milos R., Khan, Shehroz S.

arXiv.org Artificial IntelligenceOct-2-2024

Diabetic foot ulcers (DFUs) are a leading cause of hospitalizations and lower limb amputations, placing a substantial burden on patients and healthcare systems. Early detection and accurate classification of DFUs are critical for preventing serious complications, yet many patients experience delays in receiving care due to limited access to specialized services. Telehealth has emerged as a promising solution, improving access to care and reducing the need for in-person visits. The integration of artificial intelligence and pattern recognition into telemedicine has further enhanced DFU management by enabling automatic detection, classification, and monitoring from images. Despite advancements in artificial intelligence-driven approaches for DFU image analysis, the application of large language models for DFU image transcription has not yet been explored. To address this gap, we introduce UlcerGPT, a novel multimodal approach leveraging large language and vision models for DFU image transcription. This framework combines advanced vision and language models, such as Large Language and Vision Assistant and Chat Generative Pre-trained Transformer, to transcribe DFU images by jointly detecting, classifying, and localizing regions of interest. Through detailed experiments on a public dataset, evaluated by expert clinicians, UlcerGPT demonstrates promising results in the accuracy and efficiency of DFU transcription, offering potential support for clinicians in delivering timely care via telemedicine.

clinician, dfu image, language model, (13 more...)

arXiv.org Artificial Intelligence

2410.01989

Country:

North America > Canada > Ontario > Toronto (0.15)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
Europe (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Health Care Technology > Telehealth (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.76)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

Zhang, Yongjun

arXiv.org Artificial IntelligenceNov-29-2023

Large language and vision models have transformed how social movements scholars identify protest and extract key protest attributes from multi-modal data such as texts, images, and videos. This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. First, the longformer model was fine-tuned using the Dynamic of Collective Action (DoCA) Corpus. We matched the New York Times articles with the DoCA database to obtain a training dataset for downstream tasks. Second, the swin-transformer v2 models was trained on UCLA-protest imagery data. UCLA-protest project contains labeled imagery data with information such as protest, violence, and sign. Both fine-tuned models will be available via \url{https://github.com/Joshzyj/llvms4protest}. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.

language and vision model, news article, protest event, (13 more...)

arXiv.org Artificial Intelligence

2311.18241

Country:

North America > United States > New York > Suffolk County > Stony Brook (0.05)
North America > United States > California > Santa Clara County > Palo Alto (0.05)
Europe > Switzerland (0.05)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.31)

Add feedback